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Abstract 

Single- valued point forecasts continue to be issued and used in almost all realms of 
science and society. Typically, competing point forecasters or forecasting procedures 
are compared and assessed by means of an error measure or scoring function, such as 
the absolute error or the squared error, that depends both on the point forecast and the 
realizing observation. The individual scores are then averaged over forecast cases, to 
result in a summary measure of the predictive performance, such as the mean absolute 
error or the (root) mean squared error. I demonstrate that this common practice can 
lead to grossly misguided inferences, unless the scoring function and the forecasting 
task are carefully matched. 

Effective point forecasting requires that the scoring function be specified a priori, 
or that the forecaster receives a directive in the form of a statistical functional, such 
as the mean or a quantile of the predictive distribution. If the scoring function is 
specified a priori, the forecaster can issue an optimal point forecast, namely, the Bayes 
rule, which minimizes the expected loss under the forecaster's predictive distribution. 
If the forecaster receives a directive in the form of a functional, it is critical that the 
scoring function be consistent for it, in the sense that the expected score is minimized 
when following the directive. Any consistent scoring function induces a proper scoring 
rule for probabilistic forecasts, and a duality principle links Bayes rules and consistent 
scoring functions. 

A functional is elicitable if there exists a scoring function that is strictly consistent 
for it. Expectations, ratios of expectations and quantiles are elicitable. For example, 
a scoring function is consistent for the mean functional if and only if it is a Bregman 
function. It is consistent for a quantile if and only if it is generalized piecewise linear. 
Similar characterizations apply to ratios of expectations and to expectiles. Weighted 
scoring functions are consistent for functionals that adapt to the weighting in peculiar 
ways. Not all functionals are elicitable; for instance, conditional value-at-risk is not, 
despite its popularity in quantitative finance. 

Key words and phrases: Bayes rule; Bregman function; conditional value-at-risk 
(CVaR); consistency; decision theory; elicitability; expectile; mean; median; mode; 
optimal point forecast; piecewise linear; proper scoring rule; quantile; statistical func- 
tional 
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1 Introduction 



In many aspects of human activity, a major desire is to make forecasts for an uncertain future. 
Consequently, forecasts ought to be probabilistic in nature, taking the form of probability 
distributions over future quantities or events (Dawid 1984; Gneiting 2008a). Still, many 
practical situations require single-valued point forecasts, for reasons of decision making, 
market mechanisms, reporting requirements, communications, or tradition, among others. 

1.1 Using scoring functions to evaluate point forecasts 

In this type of situation, competing point forecasters or forecasting procedures are compared 
and assessed by means of an error measure, such as the absolute error or the squared error, 
which is averaged over forecast cases. Thus, the performance criterion takes the form 

1 n 

S = -^S(zi,jfc)> (1) 

8=1 

where there are n forecast cases with corresponding point forecasts, X\, . . . , x n , and verifying 
observations, yi, ■ ■ ■ ,y n - The function S depends both on the forecast and the realization, 
and we refer to it as a scoring function. 

Table [I] lists some commonly used scoring functions. We generally take scoring functions to 
be negatively oriented, that is, the smaller, the better. The absolute error and the squared 
error are of the prediction error form, in that they depend on the forecast error, x — y, only, 
and they are symmetric, in that S(x,y) = S(y,x). The absolute percentage error and the 
relative error are used for strictly positive quantities only; they are neither of the prediction 
error form nor symmetric. Patton (2009) discusses these as well as many other scoring 
functions that have been used to assess point forecasts for a strictly positive quantity, such 
as an asset value or a volatility proxy. 



Table 1: Some commonly used scoring functions. 
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Our next two tables summarize the use of scoring functions in academia, the public and the 
private sector. Table [2] surveys the 2008 volumes of peer-reviewed journals in forecasting 
(Group I) and statistics (Group II), along with premier journals in the most prominent 
application areas, namely econometrics (Group III) and meteorology (Group IV). We call an 
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article a forecasting paper if it contains a table or a figure in which the predictive performance 
of a forecaster or forecasting method is summarized in the form of the mean score (pQ), or 
a monotone transformation thereof, such as the root mean squared error. Not surprisingly, 
the majority of the Group I papers are forecasting papers, and many of them employ several 
scoring functions simultaneously. Overall, the squared error is the most popular scoring 
function in academia, particularly in Groups III and IV, followed by the absolute error and 
the absolute percentage error. 

Table [3] reports the use of scoring functions in businesses and organizations, according to 
surveys conducted or summarized by Carbone and Armstrong (1982), Mentzner and Kahn 
(1995), McCarthy et al. (2006) and Fildes and Goodwin (2007). In addition to the squared 
error and the absolute error, the absolute percentage error has been very widely used in 
practice, presumably because business forecasts focus on demand, sales, or costs, all of 
which are nonnegative quantities. 

There are many options and considerations in choosing a scoring function. What scoring 
function ought to be used in practice? Do the standard choices have theoretical support? 
Arguably, there is considerable contention in the scientific community, along with a critical 
need for theoretically principled guidance. Some 20 years ago, Murphy and Winkler (1987, 
p. 1330) commented on the state of the art in forecast evaluation, noting that 

"[...] verification measures have tended to proliferate, with relatively little effort being 
made to develop general concepts and principles [. . . ] This state of affairs has impacted 
the development of a science of forecast verification." 

Nothing much has changed since. Armstrong (2001) called for further research, while 
Moskaitis and Hansen (2006) asked 

"Deterministic forecasting and verification: A busted system?" 

Similarly, the recent review by Fildes et al. (2008, p. 1158) states that 

"Defining the basic requirements of a good error measure is still a controversial issue." 



1.2 Simulation study 

To focus issues and ideas, we consider a simulation study, in which we seek point forecasts 
for a highly volatile daily asset value, y t . The data generating process is such that y t is a 
realization of the random variable 

Yt = Zl (2) 

where Z t follows a Gaussian conditionally heteroscedastic time series model (Engle 1982; 
Bollerslev 1986), with the parameter values proposed by Christoffersen and Diebold (1996), 
in that 

Z t ~ JV(0, o- 2 t ) where a 2 t = 0.20 Z 2 _ x + 0.75 o 2 t _ x + 0.05. 
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Table 2: Use of scoring functions in the 2008 volumes of leading peer-reviewed journals 
in forecasting (Group I), statistics (Group II), econometrics (Group III) and meteorology 
(Group IV). Column 2 shows the total number of papers published in 2008 under Web of 
Science document type article, note or review. Column 3 shows the number of forecasting 
papers (FP), that is, the number of articles with a table or figure that summarizes predic- 
tive performance in the form of the mean score (jTJ) or a monotone transformation thereof. 
Columns 4 through 7 show the number of papers employing the squared error (SE), absolute 
error (AE), absolute percentage error (APE), or miscellaneous (MSC) other scoring func- 
tions. The sum of columns 4 through 7 may exceed the number in column 3, because of 
the simultaneous use of multiple scoring functions in some articles. Papers that apply error 
measures to evaluate estimation methods, rather than forecasting methods, have not been 
considered in this study. 





Total 


FP 


SE 


AE 


APE 


MSC 


Group I: Forecasting 


International Journal of Forecasting 


41 


32 


21 


10 


8 


4 


Journal of Forecasting 


39 


25 


23 


13 


5 


3 


Group II: Statistics 


Annals of Applied Statistics 


62 


8 


6 


3 


1 





Annals of Statistics 


100 


5 


3 


2 








Journal of the American Statistical Association 


129 


10 


9 


1 








Journal of the Royal Statistical Society Ser. B 


49 


5 


4 


1 








Group III: Econometrics 


Journal of Business and Economic Statistics 


26 


9 


8 


2 


1 





Journal of Econometrics 


118 


5 


5 











Group IV: Meteorology 


Bulletin of the American Meteorological Society 


73 


1 


1 











Monthly Weather Review 


300 


63 


58 


8 


2 





Quarterly Journal of the Royal Meteorological Society 


148 


19 


19 











Weather and Forecasting 


79 


26 


20 


11 
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Table 3: Use of scoring functions in the evaluation of point forecasts in businesses and 
organizations. Columns 2 through 4 show the percentage of survey respondents using the 
squared error (SE), absolute error (AE) and absolute percentage error (APE), with the 
source of the survey listed in column 1. 



Source 


SE 


AE 


APE 


Carbone and Armstrong (1982), Table 1 


27% 


19% 


9% 


Mentzner and Kahn (1995), Table VIII 


10% 


25% 


52% 


McCarthy, Davis, Golicic and Mentzner (2006), Table VIII 


6% 


20% 


45% 


Fildes and Goodwin (2007), Table 5 


9% 


36% 


44% 



Table 4: The mean error measure (CE]) for the three point forecasters in the simulation study, 
using the squared error (SE), absolute error (AE), absolute percentage error (APE) and 
relative error (RE) scoring functions. 



Forecaster 


SE 


AE 


APE 


RE 


Statistician 


5.07 


0.97 


2.58 x 10 5 


0.97 


Optimist 


22.73 


4.35 


13.96 x 10 5 


0.87 


Pessimist 


7.61 


0.96 


0.14 x 10 5 


19.24 



We consider three forecasters, each of whom issues a one-day ahead point forecast for the 
asset value. The statistician has knowledge of the data generating process and the actual 
value of the conditional variance a t , and thus predicts the true conditional mean, 

x t = E(Y t \^) = al 

as her point forecast. The optimist always predicts x t = 5. The pessimist always issues the 
point forecast Xt = 0.05. Figure [1] shows these point forecasts along with the realizing asset 
value for 200 successive trading days. There ought to be little contention as to the predictive 
performance, in that the statistician is more skilled than the optimist or the pessimist. 

Table H] provides a formal evaluation of the three forecasters for a sequence of n = 100, 000 
sequential forecasts, using the mean score ([1]) and the scoring functions listed in Table [TJ 
The results are counterintuitive and disconcerting, in that the pessimist has the best (lowest) 
score both under the absolute error and the absolute percentage error scoring functions. In 
terms of relative error, the optimist performs best. Yet, what we have done here is common 
practice in academia and businesses, in that point forecasts are evaluated by means of these 
scoring functions. 
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Figure 1: A realized series of volatile daily asset prices under the data generating process 
(J2J), shown by circles, along with the one-day ahead point forecasts by the statistician (blue 
line), the optimist (orange line at top) and the pessimist (red line at bottom). 

1.3 Discussion 

The source of these disconcerting results is aptly explained in a recent paper by Engelberg, 
Manski and Williams (2009, p. 30): 

"Our concern is prediction of real-valued outcomes such as firm profit, GDP, growth, 
or temperature. In these cases, the users of point predictions sometimes presume 
that forecasters report the means of their subjective probability distributions; that is, 
their best point predictions under square loss. However, forecasters are not specifically 
asked to report subjective means. Nor are they asked to report subjective medians 
or modes, which are best predictors under other loss functions. Instead, they are 
simply asked to 'predict' the outcome or to provide their 'best prediction', without 
definition of the word 'best.' In the absence of explicit guidance, forecasters may report 
different distributional features as their point predictions. Some may report subjective 
means, others subjective medians or modes, and still others, applying asymmetric loss 
functions, may report various quantiles of their subjective probability distributions." 

Similarly, Murphy and Daan (1985, p. 391) noted that 

"It will be assumed here that the forecasters receive a 'directive' concerning the pro- 
cedure to be followed [. . . ] and that it is desirable to choose an evaluation measure 
that is consistent with this concept. An example may help to illustrate this concept. 
Consider a continuous [. . . ] predictand, and suppose that the directive states 'forecast 
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the expected (or mean) value of the variable.' In this situation, the mean square error 
measure would be an appropriate scoring rule, since it is minimized by forecasting the 
mean of the (judgemental) probability distribution. Measures that correspond with a 
directive in this sense will be referred to as consistent scoring rules (for that directive)." 

Despite these well-argued perspectives, there has been little recognition that the common 
practice of requesting 'some' point forecast, and then evaluating the forecasters by using 
'some' (set of) scoring function(s), is not a meaningful endeavor. In this paper, we develop 
the perspectives of Murphy and Daan (1985) and Engelberg et al. (2009) and argue that 
effective point forecasting depends on 'guidance' or 'directives', which can be given in one 
of two complementary ways, namely, by disclosing the scoring function ex ante to the 
forecaster, or by requesting a specific functional of the forecaster's predictive distribution, 
such as the mean or a quantile. 

As to the first option, the a priori disclosure of the scoring function allows the forecaster 
to tailor the point predictor to the scoring function at hand. In particular, this permits 
our statistician forecaster to mutate into Mr. Bayes, who issues the optimal point forecast, 
namely the Bayes rule, 

x = argmina- E F S(x, Y), (3) 

where the expectation is taken with respect to the forecaster's subjective or objective predic- 
tive distribution, F. For example, if the scoring function S is the squared error, the optimal 
point forecast is the mean of the predictive distribution. In the case of the absolute error, 
the Bayes rule is any median of the predictive distribution. The class 

(0 * 0) (4) 

of scoring functions nests both the absolute percentage error ((3 = —1) and the relative error 
((3 = 1) scoring functions. If the predictive distribution F has density / on the positive 
half- axis and a finite fractional moment of order (3, the optimal point forecast under the loss 
or scoring function (jlj) is the median of a random variable whose density is proportional to 
yPf(y)- We call this the (3 -median of the probability distribution F and write meS^(F). 
The traditional median arises in the limit as f3 — > 0. 

Table [5] summarizes our discussion, in that it shows the optimal point forecast, or Bayes 
rule, under the scoring functions in Table HJ both in full generality and in the special case 
of the true predictive distribution under the data generating process (EJ). Table [6] shows the 
mean score ([1]) for the new competitor Mr. Bayes in the simulation study, who issues the 
optimal point forecast. As expected, Mr. Bayes outperforms his colleagues. 

An alternative to disclosing the scoring function is to request a specific functional of the 
forecaster's predictive distribution, such as the mean or a quantile, and to apply any scoring 
function that is consistent with the functional, roughly in the following sense. 

Let the interval I be the potential range of the outcomes, such as I = M for a real-valued 
quantity, or I = (0, oo) for a strictly positive quantity, and let the probability distribution F 
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Table 5: Bayes rules under the scoring functions in Tabled] as a functional of the forecaster's 
predictive distribution, F. The functional med^\F) is defined in the text. The final column 
specializes to the true predictive distribution under the data generating process ^ in the 
simulation study. The entry for the absolute percentage error (APE) is to be understood as 
follows. The predictive distribution F has infinite fractional moment of order —1, and thus 
med ( - _1 ' ) (-F) does not exist. However, it is readily seen that the smaller the (strictly positive) 
point forecast, the smaller the expected APE. Thus, a prudent forecaster will issue some 
very small e > as point predictor. 



Scoring Function 


Bayes Rule 


Point Forecast in Simulation Study 


SE 


x = mean(F) 




AE 


x = median(i ? ) 


0.455 of 


APE 


x = med^(F) 


£ 


RE 


x = med^-'(-F) 


2.366 of 



Table 6: Continuation of Table H] showing the corresponding mean scores for the new com- 
petitor, Mr. Bayes. In the case of the APE, Mr. Bayes issues the point forecast x = e = 10 -10 . 



SE 


AE 


APE 


RE 


Mr. Bayes 5.07 


0.86 


1.00 


0.75 



be concentrated on I. Then a scoring function is any mapping S : I x I — > [0, oo). A functional 
is a potentially set- valued mapping F i-> T(F) C I. A scoring function S is consistent for 
the functional T if 

E F [8(jt,Y)]<M F [8(x,Y)] 

for all F, all t G T(F) and all x G I. It is strictly consistent if it is consistent and equality 
of the expectations implies that x G T(F). Following Osband (1985) and Lambert, Pennock 
and Shoham (2008), a functional is elicitable if there exists a scoring function that is strictly 
consistent for it. 

1.4 Plan of the paper 

The remainder of the paper is organized as follows. Section [2] develops the notions of con- 
sistency and elicitability in a comprehensive way. In addition to reviewing and unifying the 
extant literature, we present original results on weighted scoring functions that extend prior 
findings on optimal point forecasts, such as those of Park and Stefanski (1998) and Patton 
(2010). Section [3] turns to examples. The mean functional, ratios of expectations, quantiles 
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and expectiles are elicitable. Subject to weak regularity conditions, a scoring function for a 
real-valued predictand is consistent for the mean functional if and only if it is a Bregman 
function, that is, of the form 

SO, y) = (f)(y) - (j)(x) - (f)'(x) (y-x), 

where is a convex function with subgradient <p' (Savage 1971). More general and novel 
results apply to ratios of expectations and expectiles. A scoring function is consistent for 
the a-quantile if and only if it is generalized piecewise linear (GPL) of order a G (0, 1), that 
is, of the form 

S(x, y) = (t(x > y) - a) (g(x) - g(y)), 

where l(-) denotes an indicator function and g is nondecreasing (Thomson 1979; Saerens 
2000). However, not all functionals are elicitable. Notably, the conditional value-at-risk 
(CVaR) functional is not elicitable, despite its popularity as a risk measure in financial 
applications. 

The paper closes with a discussion in Section [5j which makes a plea for change in the practice 
of point forecasting. I contend that in issuing and evaluating point forecasts, it is essential 
that either the scoring function be specified ex ante, or an elicitable target functional be 
named, such as an expectation or a quantile, and scoring functions be used that are consistent 
for the target functional. 

2 A decision-theoretic approach to the evaluation of 
point forecasts 

We now develop a theoretical framework for the evaluation of point forecasts. Towards this 
end, we review the more general, classical decision-theoretic setting whose basic ingredients 
are as follows. 

(a) An observation domain, O, which comprises the potential outcomes of a future obser- 
vation. 

(b) A class T of probability measures on the observation domain O (equipped with a 
suitable cx-algebra), which constitutes a family of probability distributions for the future 
observation. 

(c) An action domain, A, which comprises the potential actions of a decision maker. 

(d) A loss function L : A x O — > [0, oo), where L(a, o) represents the monetary or societal 
cost when the decision maker takes the action a G A and the observation o G O 
materializes. 
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Given a probability distribution F G F for the future observation, the Bayes act or Bayes 
rule is any decision a G A such that 

a = argmin a E F L(a, Y), (5) 

where K is a random variable with distribution F. Thus, if the decision maker's assessment of 
the uncertain future is represented by the probability measure F, and she wishes to minimize 
the expected loss, her optimal decision is the Bayes act, a. In general, Bayes acts need not 
exist nor be unique, but in most cases of practical interest, Bayes rules exist, and frequently 
they are unique (Ferguson 1967). 

2.1 Decision-theoretic setting 

Point forecasting falls into the general decision-theoretic setting, if we assume that the ob- 
servation domain and the action domain coincide. In what follows we assume, for simplicity, 
that this common domain, 

D=0=ACR rf , 

is a subset of the Euclidean space M. d and equipped with the corresponding Borel cx-algebra. 
Furthermore, we refer to the loss function as a scoring function. With these adaptations, 
the basic components of our decision-theoretic framework follows. 

(a) A prediction- observation (PO) domain, D = D x D, which is the Cartesian product of 
the domain D C M. d with itself. 

(b) A family F of potential probability distributions for the future observation Y that 
takes values in D. 

(c) A scoring function S : T> = D x D — > [0, oo), where S(x, y) represents the loss or pe- 
nalty when the point forecast x G D is issued and the observation y G D materializes. 

In this setting, the optimal point forecast under the probability distribution F G F for the 
future observation, Y, is the Bayes act or Bayes rule (JSJ), which can now be written as 

x = argmin x S(x, Y). (6) 

We will mostly work in dimension d — 1, in which any connected domain D is simply an 
interval, I. The cases of prime interest then are the real line, I = R, and the nonnegative or 
positive halfaxis, I = [0, oo) or I = (0, oo). 

Table [7] summarizes assumptions which some of our subsequent results impose on scoring 
functions. The nonnegativity condition (SO) is standard and not restrictive. Indeed, if So is 
such that So(x, y) > So(|/, y) for all x, y G I, which is a natural assumption on a loss or scoring 
function, then S(x, y) = Sq(x, y) — So(y, y) satisfies (SO) and shares the optimal point forecast 
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Table 7: Assumptions on a scoring function S on a PO domain D = I x I, where I C R is an 
interval, x G I denotes the point forecast and y El the realizing observation. 



(50) S(x, y) > with equality if x = y 

(51) S(x, y) is continuous in x 

(52) The partial derivative S^(x,y) exists and is continuous in x whenever xj^y 

(Q, subject to integrability conditions that are not of practical concern. Generally, a loss 
function can be multiplied by a strictly positive constant and any function that depends on y 
only can be added, without changing the nature of the optimal point forecast. Furthermore, 
the optimization problem in (E]) is posed in terms of the point predictor, x. In this light, it is 
natural that assumptions (SI) and (S2) concern continuity and differentiability with respect 
to the first argument, the point forecast x. 

Efron (1991) and Patton (2010) argue that homogeneity or scale invariance is a desirable 
property of a scoring function. We adopt this notion and call a scoring function S on the 
PO domain T> = D x D homogeneous of order b if 

S(cx, cy) = \c\ b S(x, y) for all x, y G D and c G K 

which are such that cx G D and cy G D. Evidently, the underlying quest is that for 
equivariance in the decision problem. The scoring function S on the PO domain D = DxD 
is equivariant with respect to some class % of injections h : D — > D if 

arguing Ei?[S(x, h(Y))) = /i(argmin x Ei?[S(x, Y)]) 

for all h G % and all probability distributions F that are concentrated on D. For instance, 
if S is homogeneous on D = l rf or D = (0, oo) d then it is equivariant with respect to the 
multiplicative group of the linear transformations {x i— > cx : c > 0}. If the scoring function is 
of the prediction error form on D = M d , then it is equivariant with respect to the translation 
group {x t— > x + b : b G M. d }. 

While our decision-theoretic setting resembles and follows those of Osband (1985) and Lam- 
bert et al. (2008), and the subsequent development owes much to their pioneering works, 
there are distinctions in technique. For example, Osband (1985) assumes a bounded domain 
D, while Lambert et al. (2008) consider D to be a finite set. The work of Granger and 
Pesaran (2000a, 2000b), which argues in favor of closer links between decision theory and 
forecast evaluation, focuses on probability forecasts for a dichotomous event. 

2.2 Consistency 

In the decision-theoretic framework, we think of the aforementioned 'distributional feature' 
or 'directive' for the forecaster as a statistical functional. Formally, a statistical functional, 
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or simply a functional, is a potentially set- valued mapping from a class of probability distri- 
butions, J 7 , to a Euclidean space (Horowitz and Manski 2006; Huber and Ronchetti 2009; 
Wellner 2009). In the current context of point forecasting, we require that the functional 

T:^ — >D, F\ — >T(F), 

maps into the domain D C M d . Frequently, we take T to be the class of all probability 
measures on D, or the class of the probability measures with compact support in D. 

To facilitate the presentation, the following definitions and results suppress the dependence 
of the scoring function S, the functional T and the class J 7 on the domain D. 

Definition 2.1. The scoring function S is consistent for the functional T relative to the 
class T if 

E F S(t,Y) < E F S(x,Y) (7) 

for all probability distributions F G J 7 , all t G T(F) and all x G D. It is strictly consistent 
if it is consistent and equality in ([7]) implies that x G T{F). 

As noted, the term consistent was coined by Murphy and Daan (1985, p. 391), who stressed 
that is is critically important to define consistency for a fixed, given functional, as opposed to 
a generic notion of consistency, which was, correctly, refuted by Jolliffe (2008). For example, 
the squared error scoring function, S(x, y) = (x—y) 2 , is consistent, but not strictly consistent, 
for the mean functional relative to the class of the probability measures on the real line with 
finite first moment. It is strictly consistent relative to the class of the probability measures 
with finite second moment. 

In a parametric context, Lehmann (1951) and Noorbaloochi and Meeden (1983) refer to a re- 
lated property as decision-theoretic unbiasedness. The following result notes that consistency 
is the dual of the optimal point forecast property, just as decision-theoretic unbiasedness is 
the dual of being Bayes (Noorbaloochi and Meeden 1983). It thus connects the problems of 
finding optimal point forecasts, and of evaluating point predictions. 

Theorem 2.2. The scoring function S is consistent for the functional T relative to the class 
J 7 if and only if, given any F G J 7 , any x G T(F) is an optimal point forecast under S. 

Stated differently, the class of the scoring functions that are consistent for a certain functional 
is identical to the class of the loss functions under which the functional is an optimal point 
forecast. Despite its simplicity, and the proof being immediate from the defining properties, 
this duality does not appear to be widely appreciated. 

Our next result shows that the class of the consistent scoring functions is convex, and thus 
suggests the existence of Choquet representations (Phelps 1966). 
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Theorem 2.3. Let X be a measure on a measurable space (Q,A). Suppose that for all 
u G Q, the scoring function satisfies (SO) and is consistent for the functional T relative 
to the class J 7 . Then the scoring function 

SO,?/) = J S u {x,y) X(du) 

is consistent for T relative to T . 

At this point, it will be useful to distinguish the notions of a proper scoring rule (Winkler 
1996; Gneiting and Raftery 2007) and a consistent scoring function. I believe that this 
distinction is useful, even though the extant literature has failed to make it. For example, in 
referring to proper scoring rules for quantile forecasts, Cervera and Munoz (1996), Gneiting 
and Raftery (2007), Hilden (2008) and Jose and Winkler (2009) discuss scoring functions 
that are consistent for a quantile. 

Within our decision-theoretic framework, a proper scoring rule is a function S:7xD^l 
such that 

E F S(F,Y) <E F S(G,Y) (8) 

for all probability distributions F, G G J 7 , where we assume that the expectations are well- 
defined. Note that S is defined on the Cartesian product of the class J 7 and the domain 
D. The loss or penalty S(F,y) arises when a probabilistic forecaster issues the predictive 
distribution F while y G D materializes. The expectation inequality (JSJ) then implies that 
the forecaster minimizes the expected loss by following her true beliefs. Thus, the use of 
proper scoring rules encourages sincerity and candor among probabilistic forecasters. 

In contrast, a scoring function S acts on the PO domain, D = D x D, that is, the Cartesian 
product of D with itself. This is a much simpler domain than that for a scoring rule. However, 
any consistent scoring function induces a proper scoring rule in a straightforward and natural 
construction, as follows. 

Theorem 2.4. Suppose that the scoring function S is consistent for the functional T relative 
to the class J 7 . Then the function 

S : J 7 x D — ► [0, oo), (F,y) i— ► S(F,y) = S(T(F),y), 

is a proper scoring rule. 

A more general decision-theoretic approach to the construction of proper scoring rules is 
described by Dawid (2007, p. 78) and Gneiting and Raftery (2007, p. 361). 

2.3 Elicitability 

We turn to the notion of elicitability, which is a critically important concept in the evaluation 
of point forecasts. While the general notion dates back to the pioneering work of Osband 
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(1985), the term elicitable was coined only recently by Lambert et al. (2008). Whenever 
appropriate and feasible, we suppress the dependence of the definitions and results on the 
PO domain V = D x D. 

Definition 2.5. The functional T is elicitable relative to the class J 7 if there exists a scoring 
function S that is strictly consistent for T relative to J 7 . 

Evidently, if T is elicitable relative to the class J 7 , then it is elicitable relative to any subclass 
J~o C J 7 . The following result then is a version of Osband's (1985, p. 9) revelation principle. 

Theorem 2.6 (Osband). Suppose that the class J 7 is concentrated on the domain D ; and 
let g : D — >■ D be a one-to-one mapping. Then the following holds. 

(a) If T is elicitable, then T g = g o T is elicitable. 

(b) If S is consistent for T 7 then the scoring function 

S g (x,y) = Sig^ix)^) 

is consistent for T g . 

(c) // S is strictly consistent for T, then S g is strictly consistent for T g . 

The next theorem is an original result that concerns weighted scoring functions, where the 
weight function depends on the realizing observation, y, only. 

Theorem 2.7. Let the functional T be defined on a class J 7 of probability distributions which 
admit a density, f, with respect to some dominating measure on the domain D. Consider 
the weight function 

w : D ->• [0, oo). 

Let J 7 ^) C J 7 denote the subclass of the probability distributions in T which are such that 
w(y)f(y) has finite integral over D 7 and the probability measure with density propor- 
tional to w(y)f(y) belongs to J 7 . Define the functional 

T M . jrM — Fl — y tM(J?) = T(F (lu) ), (9) 
on this subclass J 7 *™). Then the following holds. 

(a) If T is elicitable, then is elicitable. 

(b) // S is consistent for T relative to J 7 , then 

S( w \x,y)=w(y)S(x,y) (10) 
is consistent for T^ relative to . 
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Table 8: The optimal point forecast or Bayes rule when the scoring function is relative 
error, S(x,y) = \(x — y)/x\, and the future quantity Y can be represented as Y = Z 2 , where 
Z has a t-distribution with mean 0, variance 1 and v > 2 degrees of freedom. In the limiting 
case as v — » oo, we take Z to be standard normal. If Z has variance a 2 the entries need to 
be multiplied by this factor. As opposed to the approximations in Table 1 of Patton (2010), 
which stem from numerical and Monte Carlo methods and are reproduced below, our results 
derive from Theorem 12.71 and are exact. For details see Appendix B. 





v = 4 


v = 6 


v = 8 


v = 10 


v — > oo 


Exact optimal point forecast 


3.4048 


2.8216 


2.6573 


2.5801 


2.3660 


Patton's approximation 


3.0962 


2.7300 


2.6067 


2.5500 


2.3600 



(c) If S is strictly consistent for T relative to T ', then is strictly consistent for T^ 
relative to . 

In other words, a weighted scoring function is consistent for the functional T^, which acts 
on the predictive distribution in a peculiar way, in that it applies the original functional, 
T, to the probability measure whose density is proportional to the product of the weight 
function and the original density. 

Theorem 12. 71 is a very general result with a wealth of applications, both in forecast evaluation 
and in the derivation of optimal point forecasts. In particular, the functional (Q is the 
optimal point forecast under the weighted scoring function (fit)]) , which allows us to unify 
and extend scattered prior results. For example, the scoring function of equation (j3J), 



i-i v -) 



xJ 



is of the form f flUj) with the original scoring function S(x,y) = Ix' 13 — y~^\ and the weight 
function w(y) = y 13 on the positive halfaxis, D = (0, oo). The scoring function S is consistent 
for the median functional. Thus, as noted in the introduction, the scoring function 
is consistent for the f3 -median functional, med^^(F), that is, the median of a probability 
distribution whose density is proportional to y^f(y), where / is the density of F. If (3 = — 1, 
we recover the absolute percentage error, S_i(x, y) = \(x—y)/y\. The case (3 = 1 corresponds 
to the relative error, Si(x,y) = \(x — y)/x\, which Patton (2010) refers to as the MAE-prop 
function. Table 1 of Patton (2010) shows Monte Carlo based approximate values for optimal 
point forecasts under this scoring function. Theorem 12.71 permits us to give exact results; 
these are summarized in Table [H] and differ notably from the approximations. 

Another interesting case arises when the original scoring function S is the squared error, 
S(x,y) = (x — y) 2 , which is consistent for the mean or expectation functional. If T is the 
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mean functional, the functional of equation ([9]) becomes 

t h {f) = t(fH) = E [z] = 

Park and Stefanski (1998) studied optimal point forecasts in the special case in which D = 
(0,oo) is the positive half-axis and w(y) = l/y 2 , so that S^ w \x,y) = (x — y) 2 /y 2 is the 
squared percentage error. By equation (TTTT) . the scoring function is consistent for the 
functional T {u,) (F) = E F [Y^ 1 ] /E F [Y~ 2 ]. By Theorem^ this latter quantity is the optimal 
point forecast under the squared percentage error scoring function, which is the result derived 
by Park and Stefanski (1998). 

Situations in which the weight function depends on the point forecast, x, need to be handled 
on a case-by-case basis. For example, a routine calculation shows that the squared relative 
error scoring function, S(x,y) = (x — y) 2 /x 2 , is consistent for the functional 

T(F) = 5^5, (12) 
v ; E F [Y) v ; 

Incidentally, by a special case of ( TTTT) the observation-weighted scoring function S(x,y) = 
y(x — y) 2 is also consistent for the functional ( TT2l) . Later on in equation ( 123]) we characterize 
the class of the scoring functions that are consistent for this functional. 

While Theorems 12.61 and 12.71 suggest that general classes of functionals are elicitable, not all 
functionals are such. The following result, which is a variant of Proposition 2.5 of Osband 
(1985) and Lemma 1 of Lambert et al. (2008), states a necessary condition. 

Theorem 2.8 (Osband). If a functional is elicitable then its level sets are convex in the 
following sense: If F G T , F\ G T and p G (0, 1) are such that F p — (1 — p)F + pFi G T , 
then t G T(F ) and t G T(F X ) imply t G T(F p ). 

For example, the sum of two distinct quantiles generally does not have convex level sets and 
thus is not an elicitable functional. Interesting open questions include those for a converse 
of Theorem 12.81 and, more generally, for a characterization of elicitability. 



2.4 Osband's principle 

Given an elicitable functional T, is there a practical way of describing and characterizing 
the class of the scoring functions that are consistent for it? The following general approach, 
which originates in the pioneering work of Osband (1985), is frequently useful. 

Suppose that the functional T is defined for a class of probability measures on the domain 
D which includes the two-point distributions. Assume that there exists an identification 
function V:DxD->l such that 

E F [V(x,Y)] = x G T(F) (13) 
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Table 9: Possible choices for the identification function V with the property fll3p in the case 
in which D = I C R is an interval. 



Functional 


Identification function 


Mean 


V(x,y) = x-y 


Ratio E F [r(Y)]/E F [s(Y)} 


V{x,y) = xs(y) - r(y) 


o-Quantile 


V{x,y) = l(x > y) - a 


r-Expectile 


V(x,y) = 2\l(x>y)-T\(x-y) 



and V(x, y) ^ unless x = y. If a consistent scoring function is available, which is smooth 
in its first argument, we can take V(x, y) to be the corresponding partial derivative. For 
example, if T is the mean or expectation functional on an interval D = I C R, we can pick 
V(x,y) = x — y, which derives from the squared error scoring function, S(x,y) = (x — y) 2 . 
Table [9] provides further examples, with the second and fourth nesting the first. 

The function 

e(c)=pS(c,a) + (l-p)S(c,b) (14) 

represents the expected score when we issue the point forecast c for a random vector Y such 
that Y = a with probability p and Y = b with probability 1 — p. Since S is consistent for 
the functional T, the identification function property ffl3l) implies that e(c) has a minimum 
at c = x, where 

pV(x,a) + (l-j))V(x,&) = 0. (15) 
If S is smooth in its first argument, we can combine (fT4j) and ([To]) to result in 

S(i)(x,a)/V(x,a) = S(i)(x,b)/V(x,b), (16) 

where S(i) denotes a partial derivative or gradient with respect to the first argument. If this 
latter equality holds for all pairwise distinct a, b and x G D, the function S(i)(x, y)/V(x, y) 
is independent of y 6 D, and we can write 

S(i)fay) = h(x)V(x,y) (17) 

for x,y £ D and some function h : D — > D. Frequently, we can integrate (fT7j) to obtain the 
general form of a scoring rule that is consistent for the functional T. 

In recognition of Osband's (1985) fundamental yet unpublished work, we refer to this gen- 
eral approach as Osband's principle. The examples in the subsequent section give various 
instances in which the principle can be successfully put to work. For a general technical 
result, see Theorem 2.1 of Osband (1985). 
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3 Examples 



We now give examples in the case of a univariate predictand, in which any connected domain 
D = I C R is an interval. Some of the results are classical, such as the characterizations 
for expectations (Savage 1971) and quantiles (Thomson 1979), and some are novel, includ- 
ing those for ratios of expectations, expectiles and conditional value-at-risk. In a majority 
of the examples, the technical arguments rely on the properties of convex functions and 
subgradients, for which we refer to Rockafellar (1970). 

3.1 Expectations 

It is well known that the squared error scoring function, S(x, y) — (x — y) 2 , is strictly 
consistent for the mean functional relative to the class of the probability distributions on R 
whose second moment is finite. Thus, means or expectations are elicitable. Before turning 
to more general settings in subsequent sections, we review a classical result of Savage (1971) 
which identifies the class of the scoring functions that are consistent for the mean functional 
as that of the Bregman functions. Closely related results have been obtained by Reichelstein 
and Osband (1984), Saerens (2000), Banerjee, Guo and Wang (2005) and Patton (2010). 

Theorem 3.1 (Savage). Let J 7 be the class of the probability measures on the interval ICR 
with finite first moment. Then the following holds. 

(a) The mean functional is elicitable relative to the class J 7 . 

(b) Suppose that the scoring function S satisfies assumptions (SO), (SI) and (S2) on the 
PO domain T> = I x I. Then S is consistent for the mean functional relative to the 
class of the compactly supported probability measures on I if, and only if, it is of the 
form 

S(x, y) = (j)(y) - <f>{x) - 4>'(x) (y - x), (18) 
where is a convex function with subgradient <f>' on I. 

(c) If <p is strictly convex, the scoring function (Fl8|) is strictly consistent for the mean 
functional relative to the class of the probability measures F on I for which both KpY 
and Kp <fi(Y) exist and are finite. 

Banerjee et al. (2005) refer to a function of the form f fl8|) as a Bregman function. For 
example, if I = R and <p(x) = \x\ a , where a > 1 to ensure strict convexity, the Bregman 
representation yields the scoring function 

S a {x, y) = \y\ a - \x\ a - a sign(x) \x\ a ' l (y - x), (19) 

which is homogeneous of order a and nests the squared error that arises when a = 2. Savage 
(1971) showed that up to a multiplicative constant squared error is the unique Bregman 
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Figure 2: The mean score ([I]) under the Patton scoring function fl20|) for Mr. Bayes (green), 
the optimist (orange) and the pessimist (red) in the simulation study of Section II. 21 



function of the prediction error form, as well as the unique symmetric Bregman function. 
Patton (2010) introduced a rich and flexible family of homogeneous Bregman functions on 
the PO domain V = (0, oo) x (0, oo), namely 

1 1 



bib 
y 



r 



(y b - x b ) - 



l 



■ X 



6-1 



-log 1 

X 



X 

ylog- 

X 



y + x 



(y-x) if 6gM\{0,1}, 



if 6 = 0, 



if b=l. 



(20) 



Up to a multiplicative constant, these are the only homogeneous Bregman functions on 
this PO domain. The squared error scoring function emerges when b = 2 and the QLIKE 
function (Patton 2010) when 6 = 0. If b = a > 1 the Patton function (|20|) coincides with the 
corresponding restriction of the power function (|T9|) . up to a multiplicative constant. 

Finally, it is worth noting that roper scoring rules for probability forecasts of a dichotomous 
event are also of the Bregman form, because the probability of a binary event equals the 
expectation of the corresponding indicator variable. Compare McCarthy (1956), Savage 
(1971), DeGroot and Fienberg (1983), Schervish (1989), Winkler (1996), Buja, Stuetzle and 
Shen (2005) and Gneiting and Raftery (2007), among others. 

Figure [2] returns to the initial simulation study of Section 11.21 and shows the mean score 
([T]) under the Patton scoring function (|2"U|) for Mr. Bayes, the optimist and the pessimist. 
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The optimal point forecast under a Bregman scoring function is the mean of the predictive 
distribution, so that the statistician forecaster fuses with Mr. Bayes. 



3.2 Ratios of expectations 

We now consider statistical functionals which can be represented as ratios of expectations. 
The mean functional emerges in the special case in which r(y) = y and s(y) = 1. 

Theorem 3.2. Let I C R be an interval, and suppose that r : I — >■ R and s : I — >■ (0, oo) are 
measurable functions. Then the following holds. 

(a) The functional 

T(F) = MM (21) 

' Ej,[»(y)] ' 

zs elicitable relative to the class of the probability measures on I /or which E F [r{Y)], 
"Kp[s{Y)\ an d E_p[Ks(y)] exist and are finite. 

(b) // S o/ t/ie /orm 

S(x, y) = s(y) (0(y) - 0(x)) - (f)'(x) (r(y) - xs(y)) + <//(y) (r(y) - ys(y)), (22) 

where cf) is a convex function with subgradient , then it is consistent for the func- 
tional (I2T!) relative to the class of the probability measures F on I for which Kp[r(Y)], 
E F [s(Y)], E F [r(Y)(f)'(Y)}, E F [s(Y)(j)(Y)} and E F [Ys(Y)(f)'(Y)} exist and are finite. If 
(p is strictly convex, then S is strictly consistent. 

(c) Suppose that the scoring function S satisfies assumptions (SO), (SI) and (S2) on the 
PO domain T> = I x I. If s is continuous and r(y) = ys(y) for y G I, then S 
is consistent for the functional (I2T]) relative to the class of the compactly supported 
probability measures on I if, and only if, it is of the form ( \22\i . where <fi is a convex 
function with subgradient 0'. 

In the case in which s(y) = w(y) and r(y) = yw(y) for a strictly positive, continuous weight 
function w, the ratio (|2T|) coincides with the functional ( ITT]) . If I = (0, oo) and w(y) = y, 
the special case T(F) = E^fF 2 ] /E F [F] of equation (Tl2|) arises. In Section [2731 we saw that 
both the squared relative error scoring function, S(x,y) = (x — y) 2 /x 2 , and the observation- 
weighted scoring function S(x,y) = y(x — y) 2 are consistent for this functional. By part (c) 
of Theorem 13.21 the general form of a scoring function that is consistent for the functional 
(UJ) is 

S(a5, y) = y (4>(y) - <i>(x)) -y(y-x) <jt(x), (23) 

where <fi is convex with subgradient 0'. The above scoring functions emerge when 4>{y) = 1/y 
and 4>(y) = y 2 , respectively. 



20 



3.3 Quantiles and expectiles 



An a-quantile (0 < a < 1) of the cumulative distribution function F is any number x for 
which lim y ^ x F(y) < a < F(x). In finance, quantiles are often referred to as value-at-risk 
(VaR; Duffie and Pan 1997). The literature on the evaluation of quantile forecasts generally 
recommends the use of the asymmetric piecewise linear scoring function, 



which is strictly consistent for the a-quantile relative to the class of the probability measures 
with finite first moment (Raiffa and Schlaifer 1961, p. 196; Ferguson 1967, p. 51). This well- 
known property lies at the heart of quantile regression (Koenker and Bassett 1978). 

As regards the characterization of the scoring functions that are consistent for a quantile, 
results of Thomson (1979) and Saerens (2000) can be summarized as follows. For a discussion 
of their equivalence and historical comments, see Gneiting (2010). 

Theorem 3.3 (Thomson, Saerens). Let J 7 be the class of the probability measures on the 
interval I C 1, and let a G (0, 1). Then the following holds. 

(a) The a-quantile functional is elicitable relative to the class J 7 . 

(b) Suppose that the scoring function S satisfies assumptions (S0) ; (SI) and (S2) on the 
PO domain V = lxl. Then S is consistent for the a-quantile relative to the class of 
the compactly supported probability measures on I if, and only if, it is of the form 



where g is a nondecreasing function on I. 

(c) If g is strictly increasing, the scoring function ( 1251) is strictly consistent for the a- 
quantile relative to the class of the probability measures F on I for which ~Epg{Y) 
exists and is finite. 

Gneiting (2008b) refers to a function of the form (125]) as generalized piecewise linear (GPL) 
of order a G (0,1), because it is piecewise linear after applying a nondecreasing transfor- 
mation. Any GPL function is equivariant with respect to the class of the nondecreasing 
transformations, just as the quantile functional is equivariant under monotone mappings 
(Koenker 2005, p. 39). If I = (0, oo) and g(x) = x b /\b\ for b G R \ {0}, and taking the 
corresponding limit as b — >• 0, we obtain the family 



S Q (z, y) = > y) - a) (x - y) 



(24) 



S(x,y) = (l(x >y)-a) (g(x)-g(y)), 



(25) 



S a ,b{x,y) 



(l(x>y) 



\ 1 ( 
a) — [x 

\b\ 



y b ) if &Gl\{0} 



(26) 



(l(x>y) 



a) log 



x 



if 6 = 0, 



y 
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Figure 3: The mean score ([1]) under the GPL power scoring function fl26|) with a — ^ f° r 
Mr. Bayes (green), the statistician (blue), the optimist (orange) and the pessimist (red) in 
the simulation study of Section 11.21 

of the GPL power scoring functions, which are homogeneous of order 6. The asymmetric 
piecewise linear function ([21]) arises when 6=1, and the MAE-LOG and MAE-SD functions 
described by Patton (2009) emerge when a — ~, and 6 = and 6 = |, respectively. 

Figure [3] returns to the simulation study in Section 11.21 and shows the mean score ([1]) under 
the GPL power function (1261) . where a = ~, for Mr. Bayes, the statistician, the optimist and 
the pessimist. Once again, Mr. Bayes dominates his competitors. 

Newey and Powell (1987) introduced the r-expectile functional (0 < r < 1) of a probability 
measure F with finite mean as the unique solution x = \i T to the equation 

/*oo r-x 

r (y-x)dF(y) = (l-r) (x-y)dF(y). 

J X J —oo 

If the second moment of F is finite, the r-expectile equals the Bayes rule or optimal point 
forecast ([6]) under the asymmetric piecewise quadratic scoring function, 

S T (x,y) = mx>y)-T\(x-y) 2 , (27) 

similarly to the a-quantile being the Bayes rule under the asymmetric piecewise linear func- 
tion fl2M . Not surprisingly, expectiles have properties that resemble those of quantiles. 
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The following original result characterizes the class of the scoring functions that are consistent 
for expectiles. It is interesting to observe the ways in which the corresponding class ( 1281) 
combines key characteristics of the Bregman and GPL families. 

Theorem 3.4. Let F be the class of the probability measures on the interval I C R with 
finite first moment, and let r G (0, 1). Then the following holds. 

(a) The r-expectile functional is elicitable relative to the class F. 

(b) Suppose that the scoring function S satisfies assumptions (SO), (SI) and (S2) on the 
PO domain D = I x I. Then S is consistent for the r-expectile relative to the class of 
the compactly supported probability measures on I if, and only if, it is of the form 



where (f) is a convex function with subgradient <f>' on I. 

(c) If is strictly convex, the scoring function (|28|) is strictly consistent for the r-expectile 
relative to the class of the probability measures F on I for which both KpY and E^0(y) 
exist and are finite. 

3.4 Conditional value-at-risk 

The ot-conditional value-at-risk functional (CVaR Q , < a < 1) equals the expectation of a 
random variable with distribution F conditional on it taking values in its upper (1 — a)-tail 
(Rockafellar and Uryasev 2000, 2002). An often convenient, equivalent definition is 



where qp denotes the /3-quantile (Acerbi 2002), similarly to the functional representation 
of the a-trimmed mean (Huber and Ronchetti 2009). The CVaR functional is a popular 
risk measure in quantitative finance. Its varied, elegant and appealing properties include 
coherency in the sense of Artzner et al. (1999), who consider functionals defined in terms of 
random variables, rather than the corresponding probability measures. 

Theorem 3.5. The CVaR Q functional is not elicitable relative to any class F of probability 
distributions on the interval ICR that contains the measures with finite support, or the 
finite mixtures of the absolutely continuous distributions with compact support. 

This negative result challenges the use of the CVaR functional as a predictive measure of risk, 
and may provide a partial explanation for the striking lack of literature on the evaluation of 
CVaR forecasts, as opposed to quantile or VaR forecasts, for which we refer to Berkowitz and 
O'Brien (2002), Giacomini and Komunjer (2005) and Bao, Lee and Saltoglu (2006), among 
others. With consistent scoring functions not being available, it remains unclear how one 
might assess and compare CVaR forecasts. 



S(x,y) = \ l(x > y) - r\ (<f>(y) - (f>(x) - <j/{x)(y - x)) 



(28) 




(29) 
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3.5 Mode 



Let J 7 be a class of probability measures on the real line, each of which has a well-defined, 
unique mode. It is sometimes stated informally that the mode is an optimal point forecast 
under the zero-one scoring function, 



where c > 0. A rigorous statement is that the optimal point forecast or Bayes rule ([6]) under 
the scoring function S c is the midpoint 



of the modal interval of length 2c of the probability measure F G F (Ferguson 1967, p. 51). 
Example 7.20 of Lehmann and Casella (1998) explores this argument in more detail. 

Expressed differently, the zero-one scoring function S c is consistent for the midpoint func- 
tional, which we denote by T c . If c is sufficiently small, then T C (F) is well-defined and 
single-valued for all F G F. We can then define the mode functional on F as the limit 



I do not know whether or not T is elicitable. However, if the members of the class F have 
continuous Lebesgue densities, then To is asymptotically elicitable, in the sense that it can 
be represented as the continuous limit of a family of elicitable functionals. 

Stronger results become available if one puts conditions on both the scoring function S and 
the family F of probability distributions. Theorem 2 of Granger (1969) is a result of this 
type. Consider the PO domain P = lxR. If the scoring function S is an even function of the 
prediction error that attains a minimum at the origin, and each F G F admits a Lebesgue 
density, /, which is symmetric, continuous and unimodal, so that mean, median and mode 
coincide, then S is consistent for this common functional. Theorem 1 of Granger (1969) 
and Theorem 7.15 of Lehmann and Casella (1998) trade the continuity and unimodality 
conditions on / for an additional assumption of convexity on the scoring function. 

Henderson, Jones and Stare (2001, p. 3087) posit that in survival analysis a loss function of 
the form 



is reasonable, with a choice of k = 2 often being adequate, arguing that "most people for 
example would accept that a lifetime prediction of, say, 2 months, was reasonably accurate if 
death occurs between about 1 and 4 months" . From the above, the optimal point forecast or 
Bayes rule under S* k is the midpoint functional Tw fc ) applied to the predictive distribution 
of the logarithm of the lifetime, rather than the lifetime itself. Henderson et al. (2001) give 
various examples. 



S c (x,y) = l(\x-y\ > c), 



x = argmax x (F(x + c) — \im y ^ x _ c F 



(v)) 



T (F)=li m(4 oT c (F). 
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4 Multivariate predictands 



While thus far we have restricted attention to point forecasts of a univariate quantity, the 
general case of a multivariate predictand that takes values in a domain D C M. d is of consid- 
erable interest. Applications include those of Gneiting et al. (2008) and Hering and Genton 
(2010) to predictions of wind vectors, or that of Laurent, Rombouts and Violante (2009) 
to forecasts of multivariate volatility, to name but a few. We turn to the decision-theoretic 
setting of Section [27T1 and assume, for simplicity, that the point forecast, the observation and 
the target functional take values in D = M d . 

We first discuss the mean functional. Assuming that S(x, y) > with equality if x — y, 
Savage (1971), Osband and Reichelstein (1985) and Banerjee et al. (2005) showed that a 
scoring function under which the (component- wise) expectation of the predictive distribution 
is an optimal point forecast, is of the Bregman form 

S(x, y) = <j){y) - (f)(x) - (V0(x), y - x), (30) 

where : M. d — > R is convex with gradient V0 : W 1 — > M. d and ( , ) denotes a scalar prod- 
uct, subject to smoothness conditions. Expressed differently, a sufficiently smooth scoring 
function is consistent for the mean functional if and only if it is of the form f l3"Uj) . which is 
a generalization of the Bregman representation (ITgj) in the case of a univariate predictand. 
When <f)(x) = \\x\\ 2 is the squared Euclidean norm, we obtain the squared error scoring 
function, and similarly its ramifications, such as the weighted squared error and the pseudo 
Mahalanobis error (Laurent et al. 2009). 

It is of interest to note that rigorous versions of the Bregman characterization depend on 
restrictive smoothness conditions. Osband and Reichelstein (1985) assume that the scoring 
function is continuously differentiable with respect to its first argument, the point forecast; 
Banerjee et al. (2005) assume the existence of continuous second partial derivatives with 
respect to the observation. A challenging, nontrivial problem is to unify and strengthen 
these results, both in univariate and multivariate settings. 

Laurent et al. (2009) consider point forecasts of multivariate stochastic volatility, where the 
predictand is a symmetric and positive definite matrix in M 9 * 9 . If the matrix is vectorized, the 
above results for the mean functional apply, thereby leading to the Bregman representation 
( 130]) for the respective consistent scoring functions, which is hidden in Proposition 3 of 
Laurent et al. (2009). Corollary 1 of Laurent et al. (2009) supplies a version thereof that 
applies directly to point forecasts, say 6 M. qxq , of a matrix-valued, symmetric and positive 
definite quantity, say T, y 6 M. qxq , without any need to resort to vectorization. Specifically, 
any scoring function of the form 

S(E X , £„) = 0(S y ) - 0(£ x ) - tr (V 0(£ x ) (£„ - £ x )) (31) 

is consistent for the (component-wise) mean functional, where <fi is convex and smooth, and 
Vo 4> denotes a symmetric matrix of first partial derivatives, with the off-diagonal elements 
multiplied by a factor of one half. 
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Dawid and Sebastiani (1999) and Pukelsheim (2006) give various examples of convex func- 
tions whose domain is the cone of the symmetric and positive definite elements of M. qxq , 
with the matrix norm ^ 

0(S)= Qtr(E*)) ' (32) 

for s > 1 being one such instance. The matrix norm is nonnegative, nondecreasing in 
the Loewner order, continuous, strictly convex, standardized and homogeneous of order 
one. With simple adaptations, the construction extends to any real or extended real-valued 
exponent s and to general, not necessarily positive definite symmetric matrices (Pukelsheim 
2006, pp. 141 and 151). In the limit as s — > in (13"2"1) the log determinant 0(S) = log det(S) 
emerges. When used in the Bregman representation (jUJ), the log determinant function gives 
rise to a well known homogeneous scoring function for point predictions of a positive definite 
symmetrically matrix-valued quantity in R qxq , namely, 

S(E X , £„) = tr (£-%) - logdet (£j%) - q, (33) 

which was introduced by James and Stein (1961, Section 5). When q = 1 the scoring function 
(I3"3"|) reduces to the Patton function (|2"01 with 6 = 0, that is, the QLIKE function. 

In the case of quantiles, the passage from the univariate functional to multivariate analogues 
is much less straightforward. Notions of quantiles for multivariate distributions based on 
loss or scoring functions have been studied by Abdous and Theodorescu (1992), Chaudhuri 
(1996), Koltchinskii (1997), Serfling (2002) and Hallin, Paindaveine and Siman (2010), among 
others. In particular, it is customary to define the median of a probability distribution F on 
R d as 

x = arguing E,f(||x — K|| — ||V||), 

where || • || denotes the Euclidean norm (Small 1990). If d — 1, this yields the traditional 
median on the real line, with the term eliminating the need for moment conditions on 
the predictive distribution (Kemperman 1987). Of course, norms and distances other than 
the Euclidean could be considered. In this more general type of situation, Koenker (2006) 
proposed that a functional based on minimizing the square of a distance be called a Frechet 
mean, and a functional based on minimizing a distance a Frechet median, just as in the 
traditional case of the Euclidean distance. 

5 Discussion 

Ideally, forecasts ought to be probabilistic, taking the form of predictive distributions over 
future quantities and events (Dawid 1984; Diebold et al. 1998; Granger and Pesaran 2000a, 
2000b; Gneiting 2008a). If point forecasts are to be issued and evaluated, it is essential that 
either the scoring function be specified ex ante, or an elicitable target functional be named, 
such as the mean or a quantile of the predictive distribution, and scoring functions be used 
that are consistent for the target functional. 
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Our plea for the use of consistent scoring functions supplements and qualifies, but does not 
contradict, extant recommendations in the forecasting literature, such as those of Armstrong 
(2001), Jolliffe and Stephenson (2003) and Fildes and Goodwin (2007). For example, Fildes 
and Goodwin (2007) propose forecasting principles for organizations, the eleventh of which 
suggests that "multiple measures of forecast accuracy" be employed. I agree, with the 
qualification that the scoring functions to be used be consistent for the target functional. 

We have developed theory for the notions of consistency and elicitability, and have char- 
acterized the classes of the loss or scoring functions that result in expectations, ratios of 
expectations, quantiles or expectiles as optimal point forecasts. Some of these results are 
classical, such as those for means and quantiles (Savage 1971; Thomson 1979), while others 
are original, including a disconcerting negative result, in that scoring functions which are 
consistent for the CVaR functional do not exist. 

In the case of the mean functional, the consistent scoring functions are the Bregman functions 
of the form ffl8|) . Among these, a particularly attractive choice is the Patton family (1201) of 
homogeneous scoring functions, which nests the squared error (SE) and QLIKE functions. 
In evaluating volatility forecasts, Patton and Sheppard (2009) recommend the use of the 
latter because of its superior power in Diebold and Mariano (1995) and West (1996) tests of 
predictive ability, which depend on differences between mean scores of the form ([T]) as test 
statistics. Further work in this direction is desirable, both empirically and theoretically. If 
quantile forecasts are to be assessed, the consistent scoring functions are the GPL functions 
of the form fl25|) . with the homogeneous power functions in (1261) being appealing examples. 
Interestingly, the scoring functions that are consistent for expectiles combine key elements 
of the Bregman and GPL families. 

As regards the most commonly used scoring functions in academia, businesses and organi- 
zations, the squared error scoring function is consistent for the mean, and the absolute error 
scoring function for the median. The absolute percentage error scoring function, which is 
commonly used by businesses and organizations, and occasionally in academia, is consistent 
for a non-standard functional, namely, the median of order —1, med* -1 - 1 , which tends to sup- 
port severe underforecasts, as compared to the mean or median. It thus seems prudent that 
businesses and organizations consider the intended or unintended consequences and reassess 
its suitability as a scoring function. 

Pers et al. (2009) propose a game of prediction for a fair comparison between competing 
predictive models, which employs proper scoring rules. As Theorem 12.41 shows, consistent 
scoring functions can be interpreted as proper scoring rules. Hence, the protocol of Pers et 
al. (2009) applies directly to the evaluation of point forecasting methods. Their focus is on 
the comparison of custom-built predictive models for a specific purpose, as opposed to the 
M-competitions in the forecasting literature (Makridakis and Hibon 1979, 2000; Makridakis 
et al. 1982, 1993), which compare the predictive performance of point forecasting methods 
across multiple, unrelated time series. In this latter context, additional considerations arise, 
such as the comparability of scores across time series with realizations of differing magnitude 
and volatility, and commonly used evaluation methods remains controversial (Armstrong and 
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Collopy 1992; Fildes 1992; Ahlburg et al. 1992; Hyndman and Koehler 2006). 

The notions of consistency and elicitability apply to point forecast competitions, where 
participants ought to be advised ex ante about the scoring function(s) to be employed, 
or, alternatively, target functional(s) ought to be named. If multiple target functionals 
are named, participants can enter possibly distinct point forecasts for distinct functionals. 
Similarly, if multiple scoring functions are to be used in the evaluation, and the scoring 
functions are consistent for distinct functionals, participants ought to be allowed to submit 
possibly distinct point forecasts. 

While thus far we have addressed forecasting or prediction problems, similar issues arise 
when the goal is estimation. Technically, our discussion relates to M-estimation (Huber 
1964; Huber and Ronchetti 2009). A century ago Keynes (1911, p. 325) derived the Breg- 
man representation (fl8l) in characterizing the probability density functions for which the 
"most probable value" is the arithmetic mean. For a contemporary perspective in terms 
of maximum likelihood and M-estimation, see Klein and Grottke (2008). Komunjer (2005) 
applied the GPL class ( 12 5 p in conditional quantile estimation, in generalization of the tra- 
ditional approach to quantile regression, which is based on the asymmetric piecewise linear 
scoring function (Koenker and Bassett 1978). Similarly, Bregman functions of the origi- 
nal form ffl8|) and of the variant in ff28l) could be employed in generalizing symmetric and 
asymmetric least squares regression. 

In applied settings, the distinction between prediction and estimation is frequently blurred. 
For example, Shipp and Cohen (2009) report on U.S. Census Bureau plans for evaluating 
population estimates against the results of the 2010 Census. Five measures of accuracy are 
to be used to assess the Census Bureau estimates, including the root mean squared error 
(SE) and the mean absolute percentage error (APE). Our results demonstrate that Census 
Bureau scientists face an impossible task in designing procedures and point estimates aimed 
at minimizing both measures simultaneously, because the SE and the APE are consistent for 
distinct statistical functionals. In this light, it may be desirable for administrative or political 
leadership to provide a directive or target functional to Census Bureau scientists, much in 
the way that Murphy and Daan (1985) and Engelberg et al. (2009) requested guidance for 
point forecasters, in the quotes that open and motivate this paper. 



Appendix A: Proofs 



Proof of Theorem\EE Given F E F, let t E T(F) and x E D. Then 





/ 



E F 8 u (t,y) X(dcj) 



< 




E F S(x,Y), 
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where the interchange of the expectation and the integration is allowable, because each S^ 
is a nonnegative scoring function. □ 



Proof of Theorem \2.4\ Given any two probability measures F,GeF, we have 

E F S(F, Y) = E F S(T(F), Y) < E F S(T(G), Y) = E F S(G, Y), 
where the expectations are well-defined, because the scoring function S is nonnegative. □ 

Proof of Theorem \2.b\ We first show part (b). Towards this end, let t g G T g (F) and x g G D. 
Then t g = g(t) for some t G T(F) and x g = g(x) for some x G D. Therefore, 

E F S g (tg, Y) = E F S(t, Y) < E F S(x, Y) = E F S g (x g , Y). 

As regards parts (c) and (a), it suffices to note that if S is strictly consistent, we have equality 
if and only if x G T(F) or, equivalently, x g G T g (F). □ 



Proof of Theorem\E7\ We first prove part (b). Let F G F {w \ t G T {w) (F) and x G D. Then 

E F S M {t,Y) = E F [w(Y)S(t,Y)} 

= fs(t,y)w(y)f(y)f,(dy) 



< 



S(t,y)dF^(y) 

S(x,y)dFW(y) 
E F [w(Y)S(x,Y)] 



w(y)f(y)fi{dy) 
w(y)f(y) p(dy) 



-i 



E, 



S {w \x,Y) 



where fj, is a dominating measure. The critical inequality holds because F^ G F^ C F 
and G T (w) (F) = T(F^). To prove parts (c) and (a), we note that the inequality is 
strict if S is strictly consistent for S, unless x G T(F^) = T^ W \F). □ 



Proof of Theorem \2.8\ Suppose that the functional T is elicitable relative to the class F 
on the domain D. Then there exists a scoring function S which is strictly consistent for it 
relative to F. Suppose now that Fq G F, F\ G F and t G D are such that t G T(Fq) and 
t G T(Fi). If x G D is arbitrary and p G (0, 1) is such that F p = (1 — p)F + pFi G F then 

E Fp S(t,Y) = (l-p)E Fo S(t,Y)+pE Fl S(t,Y) 

< (l-p)E Fo S(x,Y)+pE Fl S(x,Y) = E Fp S(x,Y). 
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Hence, t G T(F P ). 



□ 



Sketch of the proof of Theorem \3.1[ The statements in parts (b) and (c) are immediate from 
the arguments in Section 6.3 of Savage (1971), and form special cases of the more general 
result in Theorem 13.21 To prove the necessity of the representation (|18p . Savage essentially 
applied Osband's principle with the identification function V(x, y) = x — y. □ 



Proof of Theorem \3.2[ We first prove part (b). To show the sufficiency of the representation 
(122]) . let x G I and let F be a probability measure on I for which E^[r(y)], Ei?[s(V)], 
E F [r(Y)<f/(Y)], E F [s(Y)(f)(Y)] and E F [Ys(Y)(j) / (Y)) exist and are finite. Then 

is nonnegative, and is strictly positive if <ft is strictly convex and x ^ Ei?[r(y)] /Ej?[s(y)]. 

As regards part (c), it remains to show the necessity of the representation (l22l . We apply 
Osband's principle with the identification function V(x, y) = xs(y) — r(y), as proposed by 
Osband (1985, p. 14). Arguing in the same way as in Section [2741 we see that 

S ( i)(x,a)/(xs(a) -r(a)) = S (1 )(x, b)/(xs(b) - r(6)) 

for all pairwise distinct a, b and x G I. Hence, 

s (i)(^ v) = K x ) ( xs (y) - r (y)) 

for x, y G I and some function h : I — > I. Partial integration yields the representation ( 1221 . 
where s 

(j)( x ) = h{u) du ds (34) 

J XQ J XQ 

for some xq G I. Finally, <fi is convex, because the scoring function S is nonnegative, which 
implies the validity of the subgradient inequality. 

To prove part (a), we consider the scoring function (1221 with <ft(y) = y 2 /(l + \y\), for which 
the expectations in part (b) exist and are finite if, and only if, E^[r(y)], E^[s(y)] and 
Ejr[ys(y)] exist and are finite. □ 



Sketch of the proof of Theorem \3.3[ For concise yet full-fledged proofs of parts (b) and (c), 
see Gneiting (2008b), where Osband's principle is applied with the identification function 
V(x, y) — l(x > y) — a. To prove part (a), we may apply part (c) with any strictly increasing, 
bounded function g : I — > I, with g(x) = exp(— x)/(l + exp(— x)) being one such example. □ 
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Proof of Theorem 3.4 To show the sufficiency of the representation ( 1281) . let x 6 I where 
x < fi T , and let F be a probability measure with compact support in I. A tedious but 
straightforward calculation shows that if S is of the form ( |28|) then 



Kp S(x, Y) — Kp S(/i T , Y) 

= (1 - t) f (0(/i T ) - (j>{x) - </>'(x)(jjl t - x)) dF(y) 

J (— oo, x) 




(<f>(y) - <t>{?) - <j>'(x)(y - x)) dF(y) 



+ t OOr) - <P(x) - <f)'{x)(fi T - x)) dF(y) 

J[fj, T ,oo) 

+ (1 - r) ! (0(/x r ) - <P(y) - 0'(x)(/i T - y)) dF(y) 

J[x tf i T ) y v ' 

> ^>'(l/) (Mr -V)>0 

is nonnegative, and is strictly positive if is strictly convex. An analogous argument applies 
when x > // T . This proves sufficiency in part (b) as well as the claim in part (c). 

To prove the necessity of the representation (128!) in part (b), we apply Osband's principle 
with the identification function ~V(x,y) = \l(x > y) — r\ (x — y). Arguing in the usual way, 
we see that 

s (i)0>2/) = H x ) V(x,y) 

for x, y G I and some function h : I — > I. Partial integration yields the representation ( 1281) , 
where is defined as in and is convex, because S is nonnegative. 

To prove part (a), we apply part (c) with the convex function (p(y) = y 2 /(l + \y\), for which 
Kp 4>(Y) exists and is finite if, and only if, IKpY exists and is finite. □ 

Proof of Theorem \3.5[ Suppose first that F contains the measures with finite support. Let 
a,b,c,d e I be such that a < b < c < |(6 + d), which implies b < d, and consider the 
probability measures 

F 1 = a5 a + -(1 - a) (6 b + 8 d ), F 2 = a5 c + (1 - a)8(b+d)/2, 

where 5 X denotes the point measure in x G M. Then CVaR Q ,(F 1 ) = CVaR Q (F2) = \{b + d), 
while CVaR a (|(F! + F 2 )) = \{b + c + 2d) > \{b + d). Thus, the level sets of the functional 
are not convex. By Theorem 12. 8[ the CVaR functional is not elicitable relative to the class 
F. An analogous example emerges when the point measures are replaced by appropriately 
focused and centered absolutely continuous distributions with compact support. □ 
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Appendix B: Optimal point forecasts under the relative 
error scoring function (Table Ej) 



Here we address a problem posited by Patton (2010), in that we find the optimal point 
forecast or Bayes rule 



where Y = Z 2 and Z has a t-distribution with mean 0, variance 1 and v > 2 degrees of 
freedom. In the limiting case as v — > oo, we take Z to be standard normal. 

To find the optimal point forecast, we apply Theorem 12.21 and part (b) of Theorem 12 . 71 with 
the original scoring function S(x, y) = — the weight function w(y) = y and the 
domain D = (0, oo), so that S^ w \x,y) = \{x — y)/x\. By Theorem I3.3[ the scoring function S 
is consistent for the median functional. Therefore, by Theorem l2.7l the optimal point forecast 
under the weighted scoring function is the median of the probability distribution whose 
density is proportional to yf(y), where / is the density of Y, or equivalently, proportional 
to y 1 ! 2 g(y 1 / 2 ), where g is the density of Z. 

Hence, if Z has a t-distribution with mean 0, variance 1 and v > 2 degrees of freedom, 
the optimal point forecast under the relative error scoring function is the median of the 
probability distribution whose density is proportional to 



on the positive halfaxis. Using any computer algebra system, this median can readily be 
computed symbolically or numerically, to any desired degree of accuracy. For example, if 
v = 4 the optimal point forecast (l3"5|) is 



Table [8] provides numerical values along with the approximations in Table 1 of Patton (2010), 
which were obtained by Monte Carlo methods, and thus are less accurate. If Z has variance 
cr 2 , the entries in the table continue to apply, if they are multiplied by this constant. 
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